Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.
For the best experience please use the latest Chrome, Safari or Firefox browser.
## The cost of
# MongoDB ACID transactions
## in theory and practice
.
Henrik Ingo
Senior Performance Engineer, MongoDB
Highload++ 2018
-----
# Agenda
* MongoDB features:
* write concern, read preference, read concern
* Durability cost
* Theory
* Practice
* Consistency level cost
* Theory
* Practice
-----
# Write Concerns (Durability)

-----
# Write Concern (Durability)
* `w:0`
* `w:1` (async)
* `j:true` (fsync)
* `w:2`
* `w:majority` (sync)
* `w:N`
* `w:majority + j:true`
-----
# Write Concerns Notes
* `w:0` really? Yes, really.
* Default: No durability
* Durability is client side option: Power to developers!
* Replication wins fsyncs for durability
* `writeConcernMajorityJournalDefault` - ugly duckling:
* server side
* fsync by default (safe by default, wtf?)
* `w:majority` and `w:2` behave different
* Follow Raft paper
-----
# Consistency (SQL: Isolation)
* readPreference
* primary (CP-ish)
* secondary (eventual consistency, AP)
* readConcern
* local (read uncommitted, AP)
* majority (read committed)
* linearizable (serializable)
* snapshot
* sessions
* causal consistency
* transactions (ACID)
* Atomic for multi-document, multi-statement
-----
# Write Concern cost

-----
# Write Concern cost (theory)
| setting | latency (theory) |
|:--------|-----------------:|
| `w:0` | 0 |
| **`w:1`** | 1 rtt |
| `w:2` | 2 rtt |
| `w:N` | 2 rtt |
| `j:true` | 1 fsync |
| `w:2 j:true` | 2 rtt + 1 fsync |
| `w:majority` *(1)* | 2 rtt + 1 fsync |
| `w:majority j:false` | 2 rtt |
| `w:majority j:true` | 2 rtt + 1 fsync |
*1) See [writeConcernMajorityJournalDefault](https://docs.mongodb.com/manual/reference/write-concern/#acknowledgement-behavior)*
-----
# Test cluster 1 (Same AZ+PG)

-----
# Simple update (single threaded)
db.hltest.update_one( {'_id': 1}, {'$inc': {'n': 1}} )
* Note: Several results would be different for multi-threaded high load test. This test gives insight into basic behavior of the implementation.
* [> Full benchmark code](single_threaded.py)
-----
# Write Concerns

-----
# Observations
* `j:true` is 2x slower than `w:1`. This is reasonable!
* `w:majority` slower than `j:true`!
* Reason: To ensure data integrity, oplog must be flushed before new entries become visible.
* Replication cannot possibly be faster than fsync.
* Using `j:true` actually makes replication faster (25%)
* Note: You still MUST use `w:majority` for durability. `j:true` does not ensure durability across cluster during failover.
-----
# Test cluster 2 (Multi-AZ)

-----
# Write Concerns

-----
# Observations
* Largely similar to single AZ results
* Use of SSL adds 0.15 ms to `w:0`
* fsync and replication latency dominate, so client2 is not disadvantaged except for `w:1`
-----
# Test cluster 3 (Multi-region)

-----
# Write Concerns

-----
# Observations
* Geographical latency dominates everything!
* Secondaries with uneven RTT, so `w:2` and `w:3` different
* Even `w:0` has its limits, average = 5 ms as TCP buffers fill up.
* Poor 99% percentile for client2 with `j:true w:2`
-----
# Atlas Latency Calculator

# How many
# **Isolation Levels**
# do you know?
# Consistency Levels

[(C) Kyle Kingsbury jepsen.io/consistency](https://jepsen.io/consistency)
-----
# Consistency Levels & MongoDB

-----
# Consistency Levels Details
_r:majority_ has no latency overhead, but has overhead on the _MVCC_ storage engine needing to keep older snapshots in RAM.
_Linearizeable_ consistency is implemented by turning reads into no-op writes with `w:majority`.
_Causal sessions_ are implemented by passing latest timestamp to clients. If all clients were to (telepathically) exchange these timestamps with each other, the result is equivalent to _Linearizable_ isolation.
MongoDB transactions require a session. Therefore transactions provide both _Snapshot Isolation_ and _Causal Consistency_.
-----
# Simple transaction (1 thread)
amount = random.randint(-100, 100)
result1 = db.hltest.find_one( { '_id': 1 } )
db.hltest.update_one( {'_id': 1}, {'$inc': {'n': -amount}} )
result2 = db.hltest.find_one( { '_id': 2 } )
db.hltest.update_one( {'_id': 2}, {'$inc': {'n': amount}} )
result = db.hltest.aggregate( [ { '$group': { '_id': 'foo', 'total' : { '$sum': '$n' } } } ] )
assert result['total'] == 200
* Note: Several results would be different for multi-threaded high load test. This test gives insight into basic behavior of the implementation.
* [> Full benchmark code](single_threaded.py)
-----
# Consistency levels (1)

-----
# Observations
* No latency overhead from `session`
* `session` is faster than `linearizable`
* Transaction a little faster than no transaction!
* There is only 1 `w:majority` roundtrip at `commitTransaction`
* No errors from invariant, yet...
* [PYTHON-1668](https://jira.mongodb.org/browse/PYTHON-1668): Transactions: Pymongo was ignoring default connection settings, must set `w` & `r` in `start_transaction()`
-----
# Consistency levels (2)

-----
# Observations
* Invariant errors on
* `w:1 r:local rp:secondary`
* `w:m r:m rp:secondary`
* No error when using session!
* (`linearizable` cannot read from secondary.)
-----
# Consistency levels (3)

-----
# Observations
* Same results as before, but more accentuated:
* Transacion now fully 2x faster than without
* Because there are 2 updates
* `readPreference:secondary` now clearly benefits client2
-----
# Summary of main findings
* To avoid eventual consistency effects, use `linearizable`, `session` or `transaction`.
* `readPreference:secondary` benefit in global clusters
* Note: For full ACID guarantees, use transaction.
* `session` is similar to `linearizable` but no latency overhead.
* `w:majority` was slower than `j:true`.
* You still must use `w:majority`.
* Transaction faster than without!
* (But `readConcern:snapshot` has more overhead on a large and busy server, so overall transactions may or may not be faster.)
-----
# Future
* Multi-shard transactions (4.2)
* Based on 2 Phase Commit.
* Needs 2x `w:majority` commit on each shard. Better make them faster :-)
* Remove need for fsync before oplog reads
* Transactions > 16 MB
* Similar test for readonly trx (test `readConcern:snapshot`)